Which well could have looked like:
Ти , Стехо , здорова?
It may not be used with Wh-Question words (‘How’, ‘Why’ and so on):
– А ви чого, дурні, радієте? ‘But why are you happy, fouls?’ [Олександр Кониський:Наймичка,1874]
not:
– Чи ви чого , дурні , радієте?
Чи, as its counterparts in other languages in the discussion, may be also used as a disjunctive connector (as whether … or in English).
The definitive reasons for the use of this particle in cases when it is not obligatory are not fully clear, however, there seem to be regional, temporal and other biases in usage. In my analysis, I would like to explore these possible interactions.
For this project, I used GRAC corpus: [Maria Shvedova, Ruprecht von Waldenfels, Sergiy Yarygin, Mikhail Kruk, Andriy Rysin, Michał Woźniak (2017-2018): GRAC: General Regionally Annotated Corpus of Ukrainian. Electronic resource: Kyiv, Oslo, Jena. Available at uacorpus.org].
From there, I extracted questions using the following CQL-query:
<s> []{1,15} [word =="?"]
Then, I wrote a python script to extract question particles, such as А, Чи, Невже from the questions. A sample output:
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
| AUTHOR | TITLE | YEAR | COUNTRY | GENRE | REGION | CITATION | QWORD |
|---|---|---|---|---|---|---|---|
| Іван Карпенко-Карий | З Івана-пан; а з пана-Іван | 1884 | UA | DRA | UA-KRV; | А що ж зо мною буде ? | QWORD |
| Іван Карпенко-Карий | З Івана-пан; а з пана-Іван | 1884 | UA | DRA | UA-KRV; | Вже ж як ви знову пан , то я Іван і Оксанка вам не нужна ? | QWORD |
| Іван Карпенко-Карий | З Івана-пан; а з пана-Іван | 1884 | UA | DRA | UA-KRV; | Где же Иван ? | NOQWORD |
| Іван Карпенко-Карий | Чортова скала | 1884 | UA | DRA | UA-KRV; | А ти чого лежиш ? | QWORD |
| Марко Вовчок | Горпина | 1861 | UA | FIC | RU;UA-HRK;UA-CRG;UA-KYV;UA-VNC; | Чи ви ж доглядали її , батеньку ? | CZY |
| Марко Вовчок | Горпина | 1861 | UA | FIC | RU;UA-HRK;UA-CRG;UA-KYV;UA-VNC; | Скажіть же бо , що й як ? | QWORD |
| Марко Вовчок | Горпина | 1861 | UA | FIC | RU;UA-HRK;UA-CRG;UA-KYV;UA-VNC; | — А чого , дочко , мене лякаєш ? | QWORD |
| Марко Вовчок | Горпина | 1861 | UA | FIC | RU;UA-HRK;UA-CRG;UA-KYV;UA-VNC; | « Що се з нею подіялося ? | QWORD |
| Марко Вовчок | Горпина | 1861 | UA | FIC | RU;UA-HRK;UA-CRG;UA-KYV;UA-VNC; | Цілісінький день ходить мовчки та городній мак ізбирає ; а спитати , нащо ? | QWORD |
| Марко Вовчок | Данило Гурч | 1861 | UA | FIC | RU;UA-HRK;UA-CRG;UA-KYV;UA-VNC; | – А що таке ? | QWORD |
| Марко Вовчок | Інститутка | 1861 | UA | FIC | RU;UA-HRK;UA-CRG;UA-KYV;UA-VNC; | А ти , Прокопе , чому не йдеш ? | QWORD |
| Марко Вовчок | Інститутка | 1861 | UA | FIC | RU;UA-HRK;UA-CRG;UA-KYV;UA-VNC; | Чи , може , ця краля ? | CZY |
| Марко Вовчок | Інститутка | 1861 | UA | FIC | RU;UA-HRK;UA-CRG;UA-KYV;UA-VNC; | Лиха наша пані молода ? | NOQWORD |
| Марко Вовчок | Інститутка | 1861 | UA | FIC | RU;UA-HRK;UA-CRG;UA-KYV;UA-VNC; | — Яковось-то жилося тобі , серденько , самій ? | NOQWORD |
First, the ratio of questions with чи (czyratio). It was used for plots with the writer’s birthyear employed as time-axis. The formula for it is:
\[\frac{CZY}{\sum{NOQWORD+HIBA + NEVZHE + A}}\] CZY questions are not included in the Sum in denominator in order to amplify the results.
The second metrics use was more traditional percent of чи-questions - czyperc. This metrics was used for plots with particular book titles used for time-axis. The formula for this metrics is: \[\frac{CZY}{\sum{NOQWORD+HIBA + NEVZHE + A + CZY}}\]
Beware that 2CZY questions, those with чи as disjunctive connector, are excluded from denominators in both metrics, as the semantics of this use requires the obligatory use of чи.
However, the data of writers of the previous generation (around 1850) does not differ from the observations of their peers. That is quite peculiar, as the sociolinguistic situation in western Ukraine was different from the one in eastern regions.
Regarding the writers from central Ukraine, the general downward trend is vividly illustrated by a sharp decrease of czy-questions: from a common question particle in XIXth century, to one of a more marginal state in XXth century. In my opinion, this pattern reflects the similar process in the Russian language.
However, in works by authors from central or southeastern Ukraine we may see that this process slowed down or even ended approximately around 1940. And share of questions with чи started to rise gradually. The individual lifespan data of selected authors (see below) demonstrate that such a trend existed in second part of XXth century.
The scatterplot depicting the connection between czyperc in individual titles (books) and year show a quite similar to a previous one picture, which is quite expected, as this metric is mainly a proxy of the authors one.
The code that was used to make the table above
giant_ukrdata_after45_notTransl_table <- as.data.frame.matrix(table (giant_ukrdata_after45[QWORD!="QWORD"&TRANSLATOR=="",QWORD,by=AUTHOR]), keep.rownames = FALSE)
Our data shows that there is a regional & bias in use of чи particle. The scatterplot below depicts the between the birthyear of the author and proportion of questions with чи.
ggplot(writerstable[BIRTHYEAR!="NA"&BIRTHYEAR!="TRANSL"¯oreg!="NA"], aes(x=as.double(BIRTHYEAR), y=as.double(czyratio), color=macroreg)) + geom_point() + ylim(0,1) + stat_smooth(method = 'loess')
Click here to see
wgplot <- ggplot(writerstable[BIRTHYEAR!="NA"&BIRTHYEAR!="TRANSL"], aes(x=as.double(BIRTHYEAR), y=as.double(czyratio), color=macroreg)) + geom_point(aes(text=sprintf("Author:%s<br>Birthyear:%s<br>Region of birth:%s<br>macroreg:%s", rn, BIRTHYEAR, Birthreg,macroreg))) + ylim(0,1) + stat_smooth(method = 'loess')
## Warning: Ignoring unknown aesthetics: text
ggplotly(wgplot)
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
Click here to see.
Born before 1900
ggplot(writerstable[BIRTHYEAR!="NA"&BIRTHYEAR!="TRANSL"¯oreg!="NA"&BIRTHYEAR<1900], aes(x=as.double(BIRTHYEAR), y=as.double(czyratio), color=macroreg)) + geom_point() + ylim(0,1) + stat_smooth(method = 'lm')
Born after 1900
ggplot(writerstable[BIRTHYEAR!="NA"&BIRTHYEAR!="TRANSL"¯oreg!="NA"&BIRTHYEAR>1900], aes(x=as.double(BIRTHYEAR), y=as.double(czyratio), color=macroreg)) + geom_point() + ylim(0,1) + stat_smooth(method = 'lm')
Data:
#adding titleyear
all_merged_ALL[, titleyear := paste(TITLE, YEAR, sep = “-”) ]
yeartable_all_years_merged_ALL_titles <- as.data.frame.matrix(table (all_merged_ALL[QWORD!=“QWORD”&TRANSLATOR==“”&QWORD!=“”&YEAR!=“”,QWORD,by=list(titleyear)]), keep.rownames = FALSE) yeartable_all_years_merged_ALL_titles <- as.data.table(yeartable_all_years_merged_ALL_titles,keep.rownames = TRUE) #renaming columns - unifying dataset names(yeartable_all_years_merged_ALL_titles)[names(yeartable_all_years_merged_ALL_titles) == “titileyear”] = “titleyear”
#by title yeartable_all_years_merged_ALL_titles_years <- all_merged_ALL[,unique(YEAR),by=list(titleyear,AUTHOR,GENRE)]
yeartable_all_years_merged_ALL_both <- merge(yeartable_all_years_merged_ALL_titles,yeartable_all_years_merged_ALL_titles_years,by=“titleyear”)
#something new # fwrtbl_part <- full_writerstable3_withyears[,c(“rn”,“BIRTHYEAR”,“Birthreg”,“macroreg”)]
names(fwrtbl_part)[names(fwrtbl_part) == “rn”] = “AUTHOR”
yeartable_all_years_merged_ALL_both2 <- merge(yeartable_all_years_merged_ALL_both,fwrtbl_part,by=“AUTHOR”)
titlestable_all <- yeartable_all_years_merged_ALL_both2 titlestable_all$N <- NULL
titlestable_all_u <- titlestable_all[!duplicated(titlestable_all[,c(‘AUTHOR’,‘titleyear’)]),]
#SAVE such a great dataset write.csv(titlestable_all_u, file = “titlestable_all_u2.csv”, row.names = F) titlestable_all_u <- fread (“titlestable_all_u2.csv”, sep = ‘,’)
#adding genre genre_table_by_title <- all_merged_ALL[,unique(titleyear),by=list(GENRE)]
names(genre_table_by_title)[names(genre_table_by_title) == “PUBYEAR”] = “titleyear”
titlestable_all_u_bygenre <- merge(titlestable_all_u,genre_table_by_title, by=“titleyear”) names(titlestable_all_u_bygenre)[names(titlestable_all_u_bygenre) == “GENRE.x”] = “GENRE”
#adding czyperc and correct sum - input table may be titlestable_all_u or titlestable_all_u_bygenre
titlevec_correct <- as.vector(titlestable_all_u$titleyear)
for (tttitle in titlevec_correct){ ttl <- titlestable_all_u[titleyear==tttitle] no_czy_sum <- NULL no_czy_sum <- sum(ttl[1,4:9]) czyperc_ttl <- ttl[1,5]/no_czy_sum titlestable_all_u[titleyear==tttitle, czyperc:=czyperc_ttl] titlestable_all_u[titleyear==tttitle, ALLSUM:=sum(titlestable_all_u[titleyear==tttitle,3:9])] } #remove antologies with several authors titlestable_all_u_bygenre <- titlestable_all_u_bygenre[titleyear!=“Свідчення очевидців Голоду. Том ІІІ-1985”]
write.csv(titlestable_all_u, “titlestable_all_u2”,row.names = F) `
Interactive scatterplot showing texts having, in total, more than 30 questions without Wh-words and written by authors born in Ukraine.
titlegplot <- ggplot(titletable[BIRTHYEAR!="NA"&BIRTHYEAR!="TRANSL"¯oreg!="NA"&ALLSUM>30], aes(x=as.double(PUBYEAR), y=as.double(czyperc), color=macroreg)) + geom_point(aes(text=sprintf("Author:%s<br>Birthyear:%s<br>macroreg:%s<br>Title: %s<br>Genre: %s", AUTHOR, BIRTHYEAR, macroreg, titleyear, GENRE))) + ylim(0,0.6) + stat_smooth(method = 'loess')
## Warning: Ignoring unknown aesthetics: text
ggplotly(titlegplot)
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
Another Interactive plot by title without regression line and with eased filter (ALLSUM>10, that is, all texts with more than 10 not-Wh-word questions) and including authors with NA in “Birthreg”, mainly from outside Ukraine.
You can zoom this plot by using “Zoom” or “Box select” tool. Click here to see
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
The table below demonstrates the genre distribution in our dataset. As noted above, in “Preliminaries” section, the superiority of Fiction genre and consequent lack of observations (in this case, writers active in writing books in a given genre) does not allow us to draw a definitive conclusion, however some genre bias is quite clear: Academical works (including Historical) and Journalistic texts are more prone to use чи.
for (ggenre in c("ACA","CHI","DRA","FIC","HIS","JOU","MEM")){
glen <- length(table(all_merged_ALL[GENRE==ggenre,AUTHOR]))
print(paste("Number of authors in ", ggenre, " genre: ",glen,sep = ""))
}
## [1] "Number of authors in ACA genre: 12"
## [1] "Number of authors in CHI genre: 10"
## [1] "Number of authors in DRA genre: 21"
## [1] "Number of authors in FIC genre: 135"
## [1] "Number of authors in HIS genre: 13"
## [1] "Number of authors in JOU genre: 30"
## [1] "Number of authors in MEM genre: 19"
ggplot(titletable[GENRE!="LET"&GENRE!="NA"&GENRE!="PRE"&GENRE!="DIA"&GENRE!="ETH"], aes(x=as.factor(GENRE), y=czyperc)) + geom_boxplot(outlier.shape = 19)
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
Comments on the data files used
Below is the information about data files that were used to perform this research: both original data file produced by a python script (extract shown above) and the summary tables used for further plots (find interactive plots below).
Author: The writer of a given text.
Title: The title of a given text.
Year: The publication year of a given text.
Country: The primary country of writer’s activity. As of now, not used in the analysis.
Genre: The genre of a text. Fiction texts represent the main share of our database, while poetry is excluded from it completely.
Region: The main regions of author’s activity, as filled by creators of GRAC corpus.
Citation The question sentence extracted from GRAC corpus (kwic). In original outcome file, there are a lot of “false” dublicates, that is, identical strings for one title of a given author that doubled due to an unindentified error in GRAC engine. However, there sometimes were “true” dublicates, that is, identical questions found in one text, such as Невже це так? In our calculations, we used only unique question strings, omitting both “false” and “true” dublicates.
Qword Type of the question string, as determinded by the python script.
1.QWORD - Any type of Wh-Word (‘how’, ‘where’, ‘why’ and so on) is found in the question string. Even if another question particle (чи, невже a so on) is present there, the question is assigned to QWORD and then eliminated from our analysis.
2.NOQWORD - Question without any question particle (counted in our analysis).
3.CZY - One instance of Чи particle found in the question string.
4.2CZY - Two or more instances of Чи found in the question string. This usually indicates usage of чи as a disjunctive marker (‘or’), cf.:
Чи він тільки сам урятувався , чи ще хто відступив ? ‘Is it only him who escaped, or someone else managed to leave’ (Петро Панч, Облога ночі)
As of now, this usage is completely omitted from our analysis.
5.HIBA, Nevzhe - Rhetorical question particles Невже and Хiба. As of August ’19 they haven’t been studied within our analysis yet. They can’t be used together with чи, that is why in our analysis we count them as “NOQWORD” instances.
6.A - Emphasys particle A, that rarely can be used together with Невже or Хiба. In our analysis it is counted as “NOQWORD”.
ALLSUM: Sum of all not-Wh-word questions, including 2CZY questions with two чи, not included in all other calculations. This parameter is used for estimating the size of book processed for question extraction.
Find more information about the Genre shares in our database
Table of Genre attribution of each question token (not text) in our database. This data was imported from GRAC corpus.
ACA - Academical texts.
CHI - Children literature.
DIA - Diaries.
ETH - Ethnographical works (the only author here is Агатангел Кримський).
FIC - Fiction
HIS - Historical literature.
JOU - Journalism.
LET - Letters.
MEM - Memoirs.
PRE - Speeches.
As shown above, the vast majority of Genres represent FIC (Fiction), so no particular stress is given to GENRE feature, even though it is quite